Skip to content

fix: handle missing dictionary batch for null-only columns in IPC reader#9623

Merged
alamb merged 4 commits intoapache:mainfrom
joaquinhuigomez:fix/ipc-null-dictionary
Apr 8, 2026
Merged

fix: handle missing dictionary batch for null-only columns in IPC reader#9623
alamb merged 4 commits intoapache:mainfrom
joaquinhuigomez:fix/ipc-null-dictionary

Conversation

@joaquinhuigomez
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

The IPC specification states:

An edge-case for interleaved dictionary and record batches occurs when the record batches contain dictionary encoded arrays that are completely null. In this case, the dictionary for the encoded column might appear after the first record batch.

Arrow C++ (v17+) relies on this and does not emit a dictionary batch when all values in a dictionary-encoded column are null. The Rust IPC reader currently fails with "Cannot find a dictionary batch with dict id: ..." when reading such streams, making cross-language interop broken for this edge case.

What changes are included in this PR?

When the IPC reader encounters a Dictionary-typed column whose dict_id has no corresponding entry in dictionaries_by_id, it now synthesizes an empty values array of the appropriate type (via new_empty_array) instead of returning an error. This matches the spec's allowance for omitted dictionary batches on null-only columns.

Are these changes tested?

Yes. A new test (test_read_null_dict_without_dictionary_batch) writes an IPC stream with an all-null dictionary column, strips the dictionary batch message from the raw bytes to simulate C++ behavior, then verifies the Rust reader successfully decodes the stream.

Are there any user-facing changes?

IPC streams produced by C++ (or other implementations) that omit dictionary batches for null-only dictionary columns can now be read without error. Previously these streams caused a ParseError.

Per the IPC specification, dictionary batches may be omitted when all
values in a dictionary-encoded column are null.  The C++ implementation
(Arrow C++ 17+) relies on this and does not emit a dictionary batch in
such cases, which caused the Rust IPC reader to fail with:
"Cannot find a dictionary batch with dict id: ..."

When a dictionary batch is missing, synthesize an empty values array of
the appropriate type so decoding can proceed.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 29, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 6, 2026

Looks like this was added to the spec 7 years ago:

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @joaquinhuigomez -- this looks good to me

I double checked that this matches the spec and that thsi was added to the spec 7 years ago in apache/arrow@0ddc1f4 / apache/arrow#5585

cc @pierrebelzile as the filer of

I also took the liberty of merging up from main and fixing clippy and fmt and simplifying the test a bit

Comment thread arrow-ipc/src/reader.rs Outdated
// Each message is: [continuation (4 bytes)] [meta_len (4 bytes)]
// [metadata (meta_len bytes)] [body (bodyLength bytes)]
let mut header = [0u8; 4];
if std::io::Read::read_exact(&mut cursor, &mut header).is_err() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a strange way to write read_exact -- most other times I see it like

 cursor.read_exact(&mut header).is_err() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty of pushing a commit to simplify it

@alamb alamb merged commit 48727b3 into apache:main Apr 8, 2026
26 checks passed
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 8, 2026

Thanks again @joaquinhuigomez

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IPC reader: handling of dictionaries with only null values

2 participants